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ABSTRACT 



A statistical thesaurus is built dynamically, from the same 
text collection that is being searched, allowing improved 
generation of expanded query terms. The thesaurus is 
dynamic in that thesaurus records are collected, ranked, 
accessed, and applied dynamically. Thesaurus "record*" are 
actually formed as indexed documents arranged in "collec- 
tions" The collections are preferably distinguished based on 
text source (court cases versus pews wires versus patents, 
and so forth). Each record has terms assembled in indexed 
groups (or segments) which inherently reflect a ranking 
based on relevance to an initial query. After an initial query 
is received, the appropriate collections) of records may be 
searched by a conventional search and retrieval engine, the 
searches inherently returning records ranked by degree of 
relevance due to the record indexing scheme. A record 
ranking scheme avoids contamination of relevant records by 
less relevant records. The record selection and the expansion 
query term generation processes are each divided into par- 
allel threads. The separate threads correspond to respective 
text sources to enable the improved expansion query term 
generation to be provided in real time. 

19 Claims, 10 Drawing Sheets 
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STATISTICAL THESAURUS, METHOD OF ranking scheme avoids contamination of relevant records by 

FORMING SAME, AND USE THEREOF IN less relevant records. The record selection and the expansion 

QUERY EXPANSION IN AUTOMATED TEXT query term generation processes are each divided into par- 

SEARCHING allel threads. The separate threads correspond to respective 

5 text sources to enable the improved expansion query term 

BACKGROUND OF THE INVENTION generation to be provided in real time. 

1. Field of the Invention More specifically, the invention provides a dynamic sta- 
Tbe present invention relates generally to the field of lis,icaI lhesaunis including a collection of records which 

automated search and retrieval of text documents More 00013111 wighted term relationships. The statistical thesau- 

specifically, the invention relates to thesauri (especially 10 * ****** mto multiple indexed collections based on 
statistical thesauri), to the structures of the statistical s^W 1 ** source material, and is searched interactively to 

thesauri, to methods of forming the statistical thesauri, and a ^ of rclated concepts for one or more expanded 

to use of the statistical thesauri in query expansion. query lenns ' 

2. Related Art T° e mvention also provides a ranking method for collect - 

i. ' • i— • c i j r - c 15 ing ranking term relationship records, and then deriving a 

h b known in the field of .nformataon relneval that both fi^ ^ of rclaled banking method uL 

^T,!llf«rj T* 1 °I g00d * «tetenmnT the score. FurttonKKe, the method 

^Jtll^f nUmbef ° f ™*» «P to (f« "ample) 100 records, but wfll use as few 

gooa searcn terms. 20 as (for example) 25 records when a large change in score 

A statistical thesaurus is a thesaurus which contains terms occurs, or as few as (for example) SO records when any 

that are related to the headword by their co-occurrence with change in score occurs. 

the headword in text. This is in contrast to a traditional Moreover, the invention provides a statistical thesaurus 
inesaunis wnose terms, synonyms, are related to the head- structure which stores the collection of logical term rela- 
word by meaning. 25 record as a document in which the terms are 
Recent research has shown that a statistical thesaurus grouped by term weight in different indexed sections 
provides good search terms when used for query expansion, (segments) of the document This allows a conventional 
while traditional thesauri provide little improvement and document retrieval system to build the index, and create the 
may actually hurt overall performance. As an example, FIG. candidate set of records for ranking. 
6 i^trat^ synoDyms for the headword "murder" from a 30 The invention further provides a method of parallel pro- 
Uadiuorianhesaurus, while FIG. 7 illustrates the related cessing which involves dividing the statistical thesaurus into 
concepts from a statistical thesaurus. small physical collections, each with its own index, search- 
Statistical thesauri can also provide related concepts for ing the multiple collections simultaneously, and metgiDg the 
many terms not found in a traditional thesaurus, including s search results, 
current events. For example, FIG. 8 illustrates the related 

concepts for the term "Whitewater". This meaning of the BRIEF DESCRIPTION OF THE DRAWINGS 

term "Whitewater" cannot be found in any traditional the- The invention is better understood by reading the folbw- 

saurus * ing Detailed Description of the Preferred Embodiments with 

Therefore, a high performance statistical thesaurus is a 40 reference to the accompanying drawing figures, in which 

very useful tool in an information retrieval system. It is to like reference numerals refer to like elements throughout, 

improving the formation, structure and use of statistical and in which: 

thesauri that the present invention is directed. FIG. 1 illustrates a query expansion process involving use 



SUMMARY OF THE INVENTION 



of a statistical thesaurus. 



45 FIG. 2 illustrates a preferred process for forming a sta- 

The inventive statistical thesaurus provides a high degree tistical thesaurus, 

of performance, is scalable to multiple users and large FIG. 3 illustrates a preferred record selection method, 

amounts of source information, and is tunable to specific cir- a •« * . r j * ^_ . • , 

csM.rr*. infx™*.;™ i^o *k~« j u * Z t T/. , FIG. 4 illustrates a preferred term extraction process of 

source lnformaUon. The thesaurus works best when it is built ^ retr ieval method. . 

from the text collection being searched. In order to meet 50 - 

these requirements the inventive dynamic, parallel thesaurus RG 5 lUustrates vanous P araUel P hascs m a preferred 

is provided. query expansion method according to the present invention. 

Astatistical thesaurus is built djoiamically, from the same „ nG 6 iUuslr * les synonyms for the headword "murder 
text collection that is being searched, allowing improved from a lndi,ioni1 (meaning-based) thesaurus, 
generation of expanded query terms. The thesaurus is 55 nG * 7 Ulustrates re * ated concepts from a statistical (co- 
dynamic in that thesaurus records are collected, ranked, occurrence-based) thesaurus using news-based material, 
accessed, and applied dynamically. Thesaurus "records" are * illustrates related concepts for the term "Wlrite- 
actually formed as indexed documents arranged in "collec- water" resulting from use of a statistical thesaurus, 
lions". The collections are preferably distinguished based on FIG- 9 shows related concepts from a statistical thesaurus 
text source (court cases versus news wires versus patents, 60 using GENFED (legal material) searches, 
and so forth). Each record has terms assembled in indexed FIG. 10A illustrates an exemplary indexing scheme in a 
groups (or segments) which inherently reflect a ranking dictionary for a given collection, showing entries including 
based on relevance to an initial query. After an initial query a term in association with references to a document and a set 
is received, the appropriate collections) of records may be of "groups" which reflect ranking of terms based on re- 
searched by a conventional search and retrieval engine, the 65 evance. 

searches inherently returning records ranked by degree of FIG. 10B illnstrates an exemplary 'Top 100" list of 

relevance due to the record indexing scheme. A record ranked records, showing a "collection" number (based on 
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text source), a document number, and a score (based on 
rankings determined by group within a record). 

FIG. 11 A illustrates an exemplary work file record which 
is used in processing described with reference to FIGS. 4 
and 5. s 

FIG. 11B illustrates a "Top 26" Terms Table showing 
terms ordered by frequency of occurrence. 

FIG. 12 shows a Collection Map illustrating a list of 
possible text sources (court cases, news wires, patents) in 
conjunction with respective numbers of collections present 
and lists of those collections. 

FIG. 13 shows a process of selecting collections on which 
to base subsequent query expansion terms. 

FIG. 14 illustrates an exemplary hardware platform on 15 
which the inventions related to the formation, storage, and 
use of the statistical thesaurus may be implemented. 



query expansion, the list of terms to be used for expansion 
is constructed, and the statistical thesaurus is accessed. 

Unlike a traditional thesaurus, all requested terms are 
processed together to provide a single set of related con- 
cepts. The related concepts are then displayed to the user 
who can select which, if any, of the concepts to include 
within the query. The query can then be expanded again, or 
run by the user. 

The statistical thesaurus' structure and content, as well as 
its method of construction are first described. Then, a 
method of retrieval of related concepts using the thesaurus 
will be described. 

To summarize terminology, FIG. 15 illustrates the hier- 
archical relationship of the following terms according to a 
preferred embodiment of a statistical thesaurus according to 
the present invention: 



A. Collections form the statistical thesaurus 

Text Sources are the bast* of respective collections 

Threads in software can form and search respective collections 

B. or HT Records in each collection are based on respective documents 
C Bp Groups of terms are found in each record 

D. IV W W3T Tferms can include one or more word*. 



FIG. 15 illustrates the relationship of the collections, text 
sources threads, records (records correspond to documents), 
groups, and terms, according to a preferred embodiment of 30 
the statistical thesaurus according to the present invention. 

DETAILED DESCRIPTION OF TOE 
PREFERRED EMBODIMENTS 

In describing preferred embodiments of the present inven- 35 
tion illustrated in the drawings, specific terminology is 
employed for the sake of clarity. However, the invention is 
not intended to be limited to the specific terminology so 
selected, and it is to be understood that each specific element 
includes all technical equivalents which operate in a similar 40 
manner to accomplish a similar purpose. 

Astatic thesaurus is built once, and accessed many times. 
For example, a standard thesaurus is published in the form 
of a book, with a fixed set of synonyms for each headword. 
All of the work is done before the text is published. 45 

For a statistical thesaurus, the related terms for a head- 
word vary depending on the source text collection being 
searched, and over time as new material is added to the 
collection. Rebuilding a static list of related terms and 5Q 
headwords would be very computation-intensive and time 
consuming, limiting the ability to tune the thesaurus by 
source text collection and keep it current. 

As examples of a collection-specific statistical thesaurus, 
reference is made to FIG. 7, which is based on news 55 
material, and FIG. 9, which is based on legal material. 

Query expansion takes place in real time, while the user 
waits at a terminal Response time must therefore be very 
short and consistent. The present invention provides consis- 
tently short response time by processing an expansion 60 
request using parallel operations distinguished by the source 
text collectioa 

FIG. 1 illustrates a query expansion process. The process 
begins when the end user of the system enters a search query 
including one or more terms. The user may select to expand 65 
the query, as a whole for statistical retrieval, or by specific 
term or terms for boolean retrieval. If the user specifies 



The thesaurus includes plural collections, each collection 
being based on a respective test source (such as legal 
opinions, news stories, patent text, and so forth). The various 
collections are generated and searched in parallel, by respec- 
tive (concurrently-executed) threads of a computer program. 
The collections include records. The records include groups 
of terms. The groups have weights (such as 1, 2, 3, 4 or 5) 
that constitute an indexing scheme that allows the user to 
interactively search the collections to generate query expan- 
sion terms. 

The statistical thesaurus is a set of records, with each 
record containing a set of terms which are related to each 
other by their occurrence together in a body of text such as 
a document The preferred embodiment of the invention 
designates five groups of terms in each record: Group 1 
contains the most important terms from the body of text; 
group 2, the next most important terms; and so forth, through 
group 5 which contains the least important terms (although 
group 5 terms are still meaningful concepts within the body 
of text). These groupings in the document inherently reflect 
term weights for use in ranking the records during retrieval. 
The record may be generated by processing a body of text, 
and by extracting the important terms and phrases based on 
statistics using a suitable phrase recognition method such as 
that disclosed in application Sen No. 08/589,468 which is 
incorporated herein by reference. 

The statistical thesaurus for a given text collection is built 
by generating the records for a sampling of the documents 
within the collection. The sampling rate varies by collection 
type and size, with 100% being appropriate for small col- 
lections and as little as 10% for very large collections. 

Significantly, the records are then grouped by the collec- 
tions from which they were sampled. That way, the appro- 
priate set of records can be accessed based on the collection 
selected by the user. For example, a first set of records may 
be formed based on federal case law documents, and a 
second set records may be formed based on news wires. 
When a user later searches case law, the first set of records 
is used for the statistical thesaurus. When the user is 
searching news material, the second set is used. 

FIG. 2 illustrates a preferred process for forming a sta- 
tistical thesaurus. 
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First, source documents are read, the valuable terms and number of input terms, while the statistical query version is 

phrases from the documents are extracted, and thesaurus designed for a larger number of input terms, 

"records" are written. The thesaurus records are essentially The first phase of the retrieval involves accessing the 

documents having a set of (for example) five groups (or index for the provided terms from the statistical thesaurus 

document segments), each group inherently reflecting a 5 index. For boolean queries, the terms are "AND"ed together, 

ranking of the terms in the group. the statistical query terms arc "OR n cd together. In this 

The following is an example of a thesaurus "record" in a P nase (which may be called a "resolve" phase), the specific 

preferred embodiment, in which "murder was the original locations within the records for the query terms arc read 

query term: bom tne index > and merged as necessary ("OR"ed or 

GROUP1- 10 " AND " C<1 multiple terms). The output from the merge 

@ felony murder rule @ @ felony murder @ operation is a list of records, and term locations within the 

GROUP2* record. 

murder @ © malice instruction @ @ degree murder @ @ 2 terms score 13 points, and so forth through group 5 terms 

habeas @ This simple example of a record contains terms which score 10 points. The maximum score for a record is 

only in the first three groups, due to the small size of the 14 times the number of query terms. The minimum score is 

source document (the opinion in Davis u State of Tennessee 10 times the number of query terms for boolean queries, and 
and Larry Lack, 856 F.2d 35; 1988 U.S. App. LEXIS 11941 20 10 for statistical queries. 

(CA 6, 1988)). The "@" signs signal the beginning and end After each record is scored, it is potentially inserted into 

of terms to clearly delimit phrases. Of course, variations on a "Top 100" list, an illustrative example of which is shown 

this format lie within the contemplation of the present in FIG. 10B. This list contains the highest-scoring 100 

invention. records encountered so far. The list is ordered by score, with 

The thesaurus records are then processed to build a 25 the last entry being the highest-scoring entry. The current 

statistical thesaurus index and to build compressed records record is added to the list at the appropriate place, or 

which are optimized for use in later retrieval operations. discarded if it doesn't score high enough to make the list. 

FIG. 10A illustrates an exemplary indexing scheme in a The record at this point is a record number and a score, 

dictionary for a given collection, showing entries including When all records have been processed, the resolve phase 

a term in association with references to a document and a set 30 is completed. 

of "groups" which reflect ranking of terms based on rel- The List is then passed to find an ideal cutoff. Beginning 

evance. with the last entry (the highest scoring entry), the list is 

The index is a typical inverted text index, as known to processed in reverse. After 25 entries, if the score changes by 

those skilled in the art and as described by Saltonin his text, more than 10 points, the list is cut between these two 

Automatic Text Processing. Each term that appears in any 35 records. After 50 entries, the list is cut between any change 

record appears in the index with a list of records in which it in score. This cutoff routine tendfc to prevent contamination 

appears. of good entries by substantially worse entries. 

Furthermore, the index also specifies which term group As illustrated in FIG. 4, after the list is cut, the term 

the term is in. Each record can be thought of as a document. extraction portion of the retrieval takes place. Vox each 

Each group can be though of as a sub-portion (or segment) 40 record in the list, the compressed text file is accessed to read 

of the document such as a paragraph. The records are the complete text of the record. The terms in the record are 

grouped by their source collection type (legal or news), and then extracted and written to a work file, 

exist in many different physical collections. A physical An illustrative example of a work file is shown in FIG. 

collection has its own index file and compressed text file. 11A. It is understood that a work file is produced for each 

FIG. 5 is a high-level illustration of the relationship of the 45 parallel thread. After all parallel threads have terminated, the 

processes of FIG. 13, FIG. 3, and FIG. 4. FIG. 13, FIG. 3, results of the various work files are merged. The parallel 

and FIG. 4 are now described. thread implementation is described in greater detail with 

FIG. 13 illustrates a preliminary process of selecting a reference to FIG. 5. 

collection which is to be searched in deterrxuning suitable If a term is a complete match with a query term, it is not 

query expansion terms. This process allows later processes so written as it is already known to the user. For example, if 

to focus on a most appropriate source of phrases (case law, "Clinton" and "Whitewater" are already the query terms, the 

news wires, and so forth). First, the user-selected source is phrase "Bill Clinton" would be written, but the single term 

determined, and the source is looked up in a source map. A "Whitewater" or the single term "Clinton" would not be 

list of text source collections to be searched is then output. written. 

At this point, the processing illustrated in FIGS. 3 and 4 can 55 When all the records have been read and the terms have 

take place. been written, the term extraction phase is completed, and the 

FIG. 12 shows a Collection Map illustrating a list of sort phase begins, 

possible text sources (court cases, news wires, patents) in The sort phase sorts the terms in the work file, and outputs 

conjunction with respective numbers of collections present the terms in alphabetical order. Multiple occurrences of the 

and lists of those collections. The Collection Map is refer- 60 same term are now output consecutively. As they are output, 

enced to select a suitable collection. the frequency of each term is calculated. A table of the top' 

FIG. 3 illustrates a preferred record selection method. 26 terms, ranked by frequency, is maintained. FIG. 11B 

After one or more terms have been specified by the user, this illustrates an exemplary "Top 26" Terms Table showing 

method involves selection of a set of records for those terms. terms ordered by frequency of occurrence. In the preferred 

The method preferably has two embodiments, one for 65 embodiment, a term must have a minimum frequency of 2 

boolean queries and one for statistical queries. The boolean to be inserted into the table. After all sorted terms have been 

version is tuned for very high precision and a very small output, the Term Table is output to the end user. 
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The above method dynamically builds the list of related Each of the SA's 42-44 is provided access to data 
terms for any combination of query terms. Additionally, the representing phrase and thesaurus dictionaries 52-54. The 
collection of records may be supplemented at any time with SA's 42-44 can also be implemented using a variety of 
only a small update to the index file. commercially available computers, such as Models 5990 

FIG. 5 illustrates various parallel phases in a preferred 5 and 5995 manufactured by Amdahl Corporation of 
query expansion method according to the present invention. Sunnyvale, Calif. The interconnection 38 between the SR's 

In order to achieve maximum performance, several and the SA's can be any one of a number of two-way 
phases in the above process are performed in parallel A high-speed computer data interconnections well known in 
parallel thread is created for each physical collection being the art, such as the Model 7200-DX manufactured by 
searched. The statistical thesaurus is preferably built in Network Systems Corporation of Minneapolis, Minn, 
several small collections instead of a single large collection. Each of the SA's 42-44 is connected to one of a plurality 
Preferably, these collections are based on the source of text, of front end processors 56-58. The front end processors 
such as case law or news wires. 56-58 provide a connection of the system 30 one or more 

As shown in FIG. 5, after the number of collections has commonly available networks 62 for accessing digital data, 
been determined, a unique collection thread is spawned for such as an X.25 network, long distance telephone lines* 
each collection. These threads execute concurrently. That is, 15 and/or SprintNet. Connected to the network 62 are plural 
while one thread waits on a physical disk I/O, another thread terminals 64-66 which provide users access to the system 
may be processing. This parallelism allows the inventive 30. Terminals 64-66 can be dumb terminals which simply 
process to be easily dividable across processors. process and display data inputs and outputs, or they can be 

Also, as described with reference to FIG. 4, the reading of one of a variety of readily available stand-alone computers, 
compressed records, and the extraction arid writing of terms 20 such as IBM or IBM-compatible personal computers. The 
and phrases to a work file, are also performed in parallel, front end processors 56-58 can be implemented by a variety 
each of the parallel paths being distinguished by the source of commercially available devices, such as Models 4745 and 
text collection. 4705 manufactured by the Amdahl Corporation of 

Thus, the invention provides a statistical thesaurus built Sunnyvale, Calif, 
from multiple information sources. Significantly, the statis- 25 The number of components shown in FIG. 14 are for 
deal thesaurus is managed so that many different combina- illustrative purposes only. The system 30 described herein 
tions of records may be searched to correspond to the can have any number of SA's, SR's, front end processor 
collection specified by the end user. etc. Also, the distribution of processing described herein' 

As a particular example of the advantages of forming may be modified and may in fact be performed on a single 
query expansion based on source text collection, a user of 30 computer without departing from the spirit and scope of the 
the LEXIS-NEXIS™ system may select the library GEN- invention. 

FED (which contains federal case law), the library NEWS A user wishing to access the system 30 via one of the 
(which contains news media documents), or the library terminals 64-456 wOI use the network 62 to establish a 
PATENT (which contains the full text of VS. patents). connection, by means well known in the art, to one of the 
These are examples of the source text collections mentioned 35 front end processors 56-58. The front end processors 56-58 
above. If the user is searching in GENFED and the topic is handle communication with the user terminals 64-66 by 
MURDER, the related concepts provide better search per- providing output data fordisplay by the terminals 64-66 and 
formaoce if they are derived from federal case law. by processing terminal keyboard inputs entered by the user. 
Conversely, the news media search would work better if the The data output by the front end processors 56V58 includes 
term is expanded using records generated from news docu- 40 text and screen commands. The front end processor 56-58 
ments. The difference in terms is clearly illustrated in FIGS. support screen control commands, such as the commonly 
7 and 9: FIG. 7 shows related concepts for NEWS searches, known VT100 commands, which provide screen function- 
while FIG. 9 shows related concepts for GENFED searches. ality to the terminals 64-66 such as clearing the screen and 
This process is managed by sampling document collec- moving the cursor insertion point The front end processors 
lions individuaUy, and then maintaining the generated term 45 5M8 can handle other known types of terminals and/or 
records in separate collections. Then, significantly, the term stand-alone computers by providing appropriate commands, 
record collections are combined dynamically based upon the Each of the front end processors 56-58 communicates 
document collection being searched by the end user. bidirectionally, by means well known in the art, with its 

Ahardware environment in which the inventive thesaurus corresponding one of the SA's 42-44. It is also possible to 
may be developed, stored and used is shown in FIG. 14. In 50 configure the system, in a manner well known in the art, 
particular, a document search and retrieval system 30 is such that one or more of the front end processors can 
shown. The system allows a user to search a subset of a communicate with more than one of the SA's 42-44. The 
plurality of documents for particular key words or phrases. front end processors 56-^58 can be configured to "bad 
The system then allows the user to view documents that balance* the SA's 42-44 in response to data flow patterns, 
match the search request. The system 30 comprises a plu- 55 The concept of load balancing is well known in the art. 
rality of Search and Retrieval (SR) computers 32-^5 con- Each of the SA's 42-44 contains an application program 
riected via a high speed interconnection 38 to a plurality of that processes search requests input by a user at one of the 
Session Administrator (SA) computers 42-44. terminals 64-66 and passes the search request information 

Each of the SR's 32-35 is connected to one or more onto one or more of the SR's 32-35 which perform the 
document collections 46-49, each containing text for a w search and returns the results, including the text of the 
plurality of documents, indexes therefor, and other ancillary documents, to the SA's 42-44. The SA's 42-44 provide the 
data. More than one SR can access a single document user with text documents corresponding to the search results 
collection. Also, a single SR can be provided access to more via the terminals 64-66. For a particular user session (i.e. a 
than one document collection. The SR's 32-35 can be single user accessing the system via one of the terminals 
implemented using a variety of commercially available 65 64-66), a single one of the SA's 42-44 will interact with a 
computers well known in the art, such as Model EX 100 user through an appropriate one of the front end processors 
manufactured by Hitachi Data Systems of Santa Clara, Calif. 56-58. 
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The collection selection method (FIG. 13) may be a) assigning term weights for respective query terms- 
executed in either the session administrator SA computers 5) summi me to determilie a 
42-44 or in the search and retrieval computers 32-35. The : „ 7 ^ ' 
remainder of the methods described above (FIGS. 3, 4) are c ' d 1800 ^™^ gaps that occur between scores of records 
preferably executed on the search and retrieval computers 5 from among a plurality of records that are being scored; 
32-35. d) assigning, to the scores of the plurality of records being 

Of course, the inventions related to the formation , storage, scored, a cutoff score based on the discovered gaps in 

and application of the statistical thesaurus may be imple- the scores; and 

IS? °? £*l °>! !i . Va 2!! y ° f ~ m P uter . PM°™>> c) accepting only records whose scores exceed the cutoff 

^S^tJriS v l ^ ( CXampl f C T^^X , io score; wherein the assigning step includes: 

JSS^^t^J^^"^ '^-described dl) ^ m me cutoff^ J^U in the accepting 

embodimeDts of the present invention are possible, as appre- \ tT , , . av^puug 

dated by those skilled in the art in l^Tof me above StCp ' feW rc00rds " acoc P tcd "»« a la <B« **P in 

teachings. It is therefore to be understood that, within the ft J?™£ Xm '\ u * c , • -> . . . 

scope of the appended claims and their equivalents, the 9 " . ™ nkin S mclhod of claun 8 > wherem me assigning 
invention may be practiced otherwise than as specifically 15 step mclud ^ s: 

described. assigning individual term weights to respective individual 

What is claimed is: records that correspond to respective individual docu- 

1. A system including a dynamic statistical thesaurus for ments. 

use in interactively generating query expansion terms for use 10. The ranking method of claim 8, wherein the assigning 

with an automated text document search and retrieval 20 step includes: 

system, the system comprising: assigning term weights from among a set of only about 

a) means for receiving at leasl one search query term; five possible term weights. 

b) a plurality of collections of records, wherein: 11. a statistical thesaurus, comprising: 
M !ivt ^^ c ^n? aCOlleCtiOD COmSpODdSt ° areSpeC " v a Plurality of collections of indexed records; 
b2) each record in a collection has term groups addres- * 

sable by an indexing scheme; *> iecoids 00nstltlltc respective documents; 

b3) the collections are distinguished from each other 2 ) eacn document includes plural terms that are grouped 

based on respective text sample sources; and °y weight into different groups within the document; 
b4) the term groups have different weights constituting 30 

part of the indexing scheme; and 3) the groups are indexed so as to allow the records to be 

c) means for using the indexing scheme to allow a user to searched by a conventional text document search and 
interactively search the plurality of collections to gen- retrieval system so as to perform functions of: 

erate the query expansion terms to supplement the at 0 adding records to the indexed records and/or 

least one search query term. 35 ii) forming a list of related concepts for possible 

2. The system of claim 1, wherein: inclusion in expansion query terms. 

the search query term includes plural words constituting 12 - ^ statistical thesaurus of claim U, wherein 

a phrase. each document includes plural terms that are grouped by 

3. The system of claim 1, wherein: weight, from among about five possible different 
me coUecu'c^is in the plurality of collections are searched 40 weights, into about five different respective groups 

concurrently using respective threads in parallel as within the document. 

subsets of a larger set of physically distributed collec- ^ A query expansion method in a text document search 

tions. and 4 retrieval system using a statistical thesaurus to generate 

4. The system of daim 1, wherein the text sample sources expansion query terms, the method comprising: 
include: 45 dividing the statistical thesaurus into multiple small 

a text sample source consisting essentially of text of court physical collections, each physical collection having its 

case legal opinions. own index; 

5. The system of claim 1, wherein the text sample sources searching the multiple collections in parallel, using the 
include: respective indexes; and 

a text sample source consisting essentially of text of news merging the search results to form a list of related 

media documents. concepts to be included in the expansion query terms. 

6. The system of claim 1, wherein the text sample sources 14. The method of claim 13, wherein the dividing step 
include: includes: 

a text sample source consisting essentially of text of 55 dividing the statistical thesaurus into small physical col- 
patent documents. lections distinguished based on mutually different text 

7. The system of claim 1, wherein the text sample sources sample sources. 

mcludc: 15. The system of claim 14, wherein the text sample 
a first text sample source consisting essentially of text of sources include: 

court case legal opinions; ^ a text sample source consisting essentially of text of court 

a second text sample source consisting essentially of text case legal opinions. 

of news media documents; and 16. The system of claim 14, wherein the text sample 
a third text sample source consisting essentially of text of sources include: 

patent documents. a text sample source consisting essentially of text of news 

8. A ranking method for ranking term relationship records 65 media documents. 

to allow a set of related concepts to be derived for query 17. The system of claim 14, wherein the text sample 
expansion, the ranking method comprising: sources include: 
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a text sample source consisting essentially of text of a third sample source consisting essentially of text of 

patent documents. patent documents. 

18. The system of claim 14, wherein the text sample 19. The method of claim 13, wherein the searching step 

sources include: includes: 

a first text sample source consisting essentially of text of * searching the multiple collections using threads in soft- 
court case legal opinions; ware. 

a second text sample source consisting essentially of test 

of news media documents; and * * * * * 



